DATAX121-23A (HAM) & (SEC) - Introduction to Statistical Methods
On how to conduct and interpret a hypothesis test for the difference between two population proportions, see pages 477–480 Lock, R. H., Lock P. F., Morgan, K. L., Lock, E. F., & Lock, D. F. (2021). Statistics: Unlocking the power of data. Wiley.
Data from a sample of 200 patients following admission to an adult intensive care unit (ICU) in the United States of America.
| Variables | |
|---|---|
| Status | A factor denoting whether the patient lived or died |
| Sex | A factor denoting the patient’s sex, male or female |
| … | … |
xtabs( ~ Sex + Status, data = icu.df) |>
proportions("Sex") |>
as.data.frame() |>
barchart(Freq ~ Status, groups = Sex, data = _, origin = 0,
main = "Status distribution by Race",
xlab = "Status", ylab = "Proportion",
auto.key = list(title = "Race", space = "right"))If both population proportions, \(p_1 ~ \& ~ p_2\), are known—The ground “truths” (parameters) that summarise all possible values we could observe
The sampling distribution of the sample proportion, \(\hat{p}_1 - \hat{p}_2\), is
\[ \hat{p}_1 - \hat{p}_2 ~ \text{approx.} ~ \text{Normal} \! \left( \begin{array}{l} \mu_{\hat{p}_1 - \hat{p}_2} = p_1 - p_2, \\ \sigma_{\hat{p}_1 - \hat{p}_2} = \sqrt{\frac{p_1\times(1-p_1)}{n_1} + \frac{p_2\times(1-p_2)}{n_2}} \end{array} \right) \]
The use of the \(\hat{p}_1 - \hat{p}_2\) subscripts is to make it clear that we are talking about the sampling distribution of \(\hat{p}_1 - \hat{p}_2\) and not the possible values we could observe
More on 4.
These heuristics are a consequence of relying only on the sampling distribution of \(\hat{p}_1 - \hat{p}_2\) for the method taught in DATAX121
The standard error of the sample proportion, \(\hat{p}_1 - \hat{p}_2\), is
\[ \text{se}(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{\hat{p}_1\times(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2\times(1-\hat{p}_2)}{n_2}} \]
where:
The production of wood pellets often goes “out of specification”. To improve the number of pellets that conform to specifications, the manufacturer experimented with two new methods of producing pellets and randomly sampled 100 pellets produced with each method.
38 out of 100 sampled pellets produced with Method A conformed to specifications, while 29 out of 100 sampled pellets produced with Method B conformed to specifications.
| Variables | |
|---|---|
| Count | An integer denoting the number of pellets within the group |
| Conform | A factor denoting whether the group of pellets conformed to specifications, Yes or No |
| Method | A factor denoting the method used to manufacture the group of pellets, A or B |
Can we trust a confidence interval for the difference between two underlying proportions?
\(\text{se}(\hat{p}_A - \hat{p}_B) = 0.0664 ~ (4 ~ \text{dp})\)
barchart(Count ~ Conform, data = pellets.df, groups = Method,
origin = 0, xlab = "Conformed?",
auto.key = list(title = "Method", space = "right"),
main = "Distribution of conformation by Method")\[ \hat{p}_1 - \hat{p}_2 \pm z^*_{1-\alpha/2} \times \text{se}(\hat{p}_1 - \hat{p}_2) \]
where:
Recall that 38 out of 100 sampled pellets produced with Method A conformed to specifications, while 29 out of 100 sampled pellets produced with Method B conformed to specifications
Construct a 99% confidence interval for \(p_A - p_B\).
\[ \text{se}(\hat{p}_A - \hat{p}_B) = 0.0664 ~ (4 ~ \text{dp}) \]
The solution is (-0.08115218, 0.26115218)
# The R function to calculate it in one go
prop.test(x = c(38, 29), n = c(100, 100), correct = FALSE,
conf.level = 0.99)
2-sample test for equality of proportions without continuity correction
data: c(38, 29) out of c(100, 100)
X-squared = 1.818, df = 1, p-value = 0.1776
alternative hypothesis: two.sided
99 percent confidence interval:
-0.08115218 0.26115218
sample estimates:
prop 1 prop 2
0.38 0.29
For CS 8.1, the 99% confidence interval for the difference in the two methods’ underlying proportions was (-0.08115218, 0.26115218)
Suppose you want to compare proportions within the same sample
\[ \text{se}(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{\hat{p}_1 + \hat{p}_2 + (\hat{p}_1 - \hat{p}_2)^2}{n}} \]
— Wild & Seber (2000)
Another statistic that summarises a dataset with two categorical variables.
The odds of an event compare the chance that the event happens to the chance that it does not. Odds are typically expressed using a phrase with the structure “a to b”, so a ratio is implied but not actually computed.
— Utts & Heckard (2015)
Suppose a sample contains 1000 individuals, of which 400 carry the gene for a disease
In general, the higher the odds, the more likely the event is to happen
\[ \widehat{\text{Odds}} = \frac{\hat{p}}{1-\hat{p}} \]
where:
The odds ratio is more suitable than the difference between two proportions when the two-way table summarises the outcome of an event. It compares the odds of an event for two different “groups”, e.g. ethnicities and regions (Utts & Heckard, 2015).
The odds values for the two categories being compared are computed as ratios, allowing us to describe how much more likely an event is in the first group compared to the second group
Suppose a sample contains 400 individuals from region A, of which 200 carry the gene for a disease and 600 individuals from region B, of which 200 carry the same gene for a disease.
\[ \widehat{\text{OR}} = \frac{\widehat{\text{Odds}}_1}{\widehat{\text{Odds}}_2} = \frac{\hat{p}_1 \times (1- \hat{p}_2)}{(1- \hat{p}_1) \times \hat{p}_2} \]
where:
More on 1.
Inference on an odds ratio, OR, is more flexible than the method taught to infer
p1 − p2, as it is built on formal statistical model—see DATAX221.
Recall that the data came from a sample of 200 patients.
Do we meet the assumptions for inference on an OR?
What are the potential consequences for not meeting the independence assumption?
1.1111111
In DATAX121, we will only focus on how to use R to construct the interval and how to interpret such an interval
# The following function from epitools calculates the 95% C.I. for OR
oddsratio.wald(icu.tab, conf.level = 0.95)$data
Status
Sex died lived Total
female 16 60 76
male 24 100 124
Total 40 160 200
$measure
odds ratio with 95% C.I.
Sex estimate lower upper
female 1.000000 NA NA
male 1.111111 0.5468526 2.257588
$p.value
two-sided
Sex midp.exact fisher.exact chi.square
female NA NA NA
male 0.7687874 0.8558382 0.7707773
$correction
[1] FALSE
attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
epitools package expects the data to be in the form of a two-way table—see T01: Summarising Data, Slides 44–49$measure R outputThe 95% confidence interval for the odds ratio of the event, “patient died after admission to ICU”, for females compared to males was (0.5468526, 2.257588)
Note that the 95% confidence interval is asymmetrical about \(\widehat{\text{OR}}\)
This C.I. method is specfically for an odds ratio from a 2-by-2 table of counts
Two-way table
\[ \left[ \begin{array}{cc} a & b \\ c & d \end{array} \right] \]
\[ \log\left(\widehat{OR}\right) \pm z^*_{1-\alpha/2} \times \text{se}\left\{\log\left(\widehat{OR}\right)\right\} \]
where:
Non-examinable
The following exemplars only have a context and a C.I. interpretation
Biologists studying crows will capture a crow, tag it, and release it. These crows seem to remember the scientists who caught them and will scold them later. A study to examine this effect with caveman masks found that crows scolded a person wearing a caveman mask in 158 out of 444 encounters with crows, whereas crows scolded a person in a neutral mask in 109 out of 922 encounters.
Let \(p_c\) be the proportion of scoldings when volunteers are wearing the caveman mask and \(p_b\) be the proportion of scoldings when volunteers are wearing the neutral mask
If we construct a 90% confidence interval for \(p_c - p_b\), we get \((0.197, 0.279)\)
We are 90% sure that the proportion of crows that will scold is between 0.197 and 0.279 higher if the volunteer is wearing the caveman mask than if he or she is wearing the neutral mask.
A survey of students in their final year of high school asked whether they had ever used marijuana. It was found that 515 out of 1146 males had used marijuana, and 445 out of 1120 females and used marijuana. A health researcher wanted to use this data to infer the odds ratio of marijuana use between males and females.
Let \(\text{Odds}_M\) be the odds of males that had used marijuana and \(\text{Odds}_F\) be the odds of females that had used marijuana
If we construct a 95% confidence interval for \(\text{OR} = \displaystyle \frac{\text{Odds}_M}{\text{Odds}_F}\), we get \((1.0477, 1.4629)\)
With 95% confidence, we estimate that the odds of males that had used marijuana is somewhere between 1.05 and 1.46 times the odds of females that had used marijuana